Using the linear regression algorithm, we can approximate a function that will model a system. In the case of a linear regression, we hope that the data is somewhat linear so that it can be modeled by a line. We can use this function to make predictions for new input.
In [155]:
import matplotlib.pyplot as plt
import numpy as np
%matplotlib inline
%pylab inline
For example, consider the plot below, the scatter points are random, but for this example, lets imagine we are analzing the prices of homes. In the chart below the x axis can represent the square footage of a house and the y axis is the price of the house. It looks like there is a pretty clear correlation between sq footage and price. Let's create a function that given a sq footage can predict the price of a house: this is called a regression function.
In [156]:
N=100
x = np.random.rand(N) *6
y = x + np.random.rand(N)*1
plt.scatter(x,y)
plt.plot([0,6],[0.5,6.2])
Out[156]:
The goal of a regression function is to find a function for line that will fit between each of the points with the best fit. For the scatter plot above, we can apporximate a function that will minimuze the error between the function value $y$ at $x$ and the actual value of $y$ at $x$. The error can be measured by:
$$Error = y_n - (mx_n+b)$$We will then square the error: See note in appendix about why we are using the squared error
If we sum all the errors for each point our scatter we have our error function:
$SE_{line}$ is the sum of the errors of our regression line. Thereore, the larger this number, the less accurate this line would be and by the same logic, the smaller the number the more accurate the regression line. So, by that logic it would say that if we find $m$ and $x$ such that it is the minimum possible value of $SE_{line}$ then we know that we have the most accurate regression line. Before we discuss findind the minimum of $SE_{line}$, lets first do some algebra to get $SE_{line}$ into a more convient form.
Expand the binomials...
Group similar terms:
Substitute terms in parenthesis w/ mean representations....
As you can see in equation 1, we can model all the temrs in the parentehsis as the arithemtic mean of $y^2$. Then in the next step below you can see that we can put it in terms of $ny^2$. We can do this for each tem in SE_line
and then substitute back into it:
Now, subsitute these back into $SE_{line}$
In [157]:
def se_line(n,m,b, y_2_hat, x_y_2_hat, y_hat, x_2_hat, x_hat):
val = n*y_2_hat - 2*m*(n*x_y_2_hat) - 2*b * (n*y_hat) + m**2*(n*x_2_hat) + 2*m*b*(n*x_hat) + n*b**2
return val
We need to find the values of $m$ and $b$ such that $SE_{line}$ is minimized. If we want to find the minimum value of a function $f(x)$ we simply solve deriviative set eq to $0$. In our case we have a function with two variables and thus we have a plane or a surface, like the image below:
In [158]:
from mpl_toolkits.mplot3d import Axes3D
from matplotlib import cm
from matplotlib.ticker import LinearLocator, FormatStrFormatter
import math
In [159]:
fig = plt.figure()
ax = fig.gca(projection='3d')
m=np.array(range(-N,N))
b=np.array(range(-N,N))
y_2_hat = (y**2).mean()
x_y_2_hat = (x * y**2).mean()
y_hat = y.mean()
x_2_hat = (x**2).mean()
x_hat = x.mean()
n=x.shape[0]
err = se_line(n,m,b,y_2_hat, x_y_2_hat, y_hat, x_2_hat, x_hat)
X,Y = np.meshgrid(m,b)
Z=err
surf = ax.plot_surface(X, Y, Z, rstride=1, cstride=1, cmap=cm.coolwarm,
linewidth=0, antialiased=False)
fig.colorbar(surf, shrink=0.5, aspect=5)
plt.show()
We can see intuitively from the plane above, which is error across a given $m$ & $b$, that if we find the point were error is minimum we can find a point $m$ & $b$ that is optimal for our regression line.
We can use calculus to find the minimum of the error function by solving for partial deriviate of $m$ & $b$ set equal to 0. We know that when a deriviave is equal to 0 of a continous function it can either be the max or min.
Now we have a system of equtions, where m
& n
will give us the smallest error. So if we find $m$ & $b$ we will have minimuzed the squared error. Let's simply the expressions by diving by $2n$:
We can rearrange the terms...
Solve the system...
Now, in theory we have the tools we need to make linear regressions, let's try it out w/ a custom implemention using our work from above and then compare it to the popular scikit-learn implementation.
We now have a simple solution to find the regression line, let's consider our example from the beginning:
In [160]:
x_y_hat = (x * y).mean()
y_hat = y.mean()
x_2_hat = (x**2).mean()
x_hat = x.mean()
m = (x_hat * y_hat - x_y_hat) / ((x_hat)**2 - x_2_hat)
b = y_hat - m * x_hat
house_regression = lambda x: m*x+b
(m,b)
Out[160]:
So our regression line is:
In [161]:
print "y=%fx+%f"%(m,b)
Now, lets test it w/ a 10 square foot house (a value outside of our orginial dataset) :
In [162]:
custom_prediction = house_regression(10)
custom_prediction
Out[162]:
In [163]:
from sklearn import linear_model
In [164]:
# Create linear regression object
regr = linear_model.LinearRegression()
# Train the model using the training sets
# inputs into vector format
_x = x.reshape(len(x),1)
_y = y.reshape(len(y),1)
regr.fit(_x,_y)
# The coefficients
print('Coefficients: \n', regr.coef_)
In [165]:
scikit_prediction = regr.predict(10)
print custom_prediction
print scikit_prediction[0][0]
As you can see our custom implementation returns the same answer as the sci-kit implementation!
The reason we choose squared error error is because of the nice shape that squared errors will make when we make a graph of the squared error vs m and b. The graph will make a 3-d parabola with the smallest square error being at our optimally chosen m and b. Since this graph has only 1 minimum value it is really nice since we can always find this minimum, and the minimum will be unique. If we use higher exponents it would be harder to find the minimum value(s), and we could find possibly non unique minimums or only local minimums (values that look good compared to the neighbouring values but not the absolute best). So, in summary we used squared error because it gives us a minimum that is easy to find and is guaranteed to be the only minimum (this guarantees it is the best!).
In [ ]: